library(CTexploreR)
library(tidyverse)
library(SummarizedExperiment)
library(ggrepel)

Existing CT databases

1. CTdatabase

CTdatabase

CAVEATS:

  • Problematic gene names, no precise chromosomal location, not an importable format…

  • Expression profiles based on old methods (no RNAseq data)

  • No data about DNA methylation

  • Recently identified CT genes not present in the database

2. CT list from Wang’s paper

CT Selection procedure:

CAVEATS:

  • Just a list of gene in supplemental data

  • Selection criteria not stringent (many “CT genes” are not really testis-specific)

  • Based on RPKM values (GTEX v6 and TCGA), not TPM

  • Multi-mapping problem not well addressed

  • No individual data about promoter DNA methylation

CTexploreR CT gene selection

CT genes table

DT::datatable(CT_genes)

Detailed selection procedure

Testis-specific expression

Testis-specific genes (expressed exclusively in testis) and testis-preferential genes (expressed in a few somatic tissues at a level lower than 10x testis expression) were first selected using GTEx database.

Note that some genes were undetectable in GTEx database due to multimapping issues (these were flagged as “lowly expressed” in GTEX_category column). A careful inspection of these genes showed that many of them are well-known Cancer-Germline genes, belonging to gene families (MAGEA, SSX, CT45A, GAGE, …) from which members have identical or nearly identical sequences. This is likely the reason why these genes are not detected in GTEx database, as GTEX processing pipeline specifies that overlapping intervals between genes are excluded from all genes for counting.

For these genes, as testis-specificy could not be assessed using GTEx database, RNAseq data from a set of normal tissues were home-processed in order to allow multimapping. Expression of several genes became detectable in some tissues. Genes showing a testis-specific expression (expression at least 10x higher in testis than in any somatic tissues when multimapping was allowed) were selected, and flagged as testis-specific in multimapping_analysis column.

Activation in cancer cell lines

RNAseq data from cancer cell lines from CCLE were used to select among testis-specific and testis-prefential genes those that are activated in cancers. In the CCLE_category column, genes are tagged as “activated” when they are highly expressed in at least one cancer cell line (TPM >= 10). But genes that were found to be expressed in all -or almost all- cancer cell lines were removed, as this probably reflects a constitutive expression rather than a true activation. We filtered out genes that were not completely repressed in at least 20 % of cell lines (TPM <= 0.1).

Activation in TCGA cancers

RNAseq data from TCGA cancer samples were used to select among testis-specific and testis-prefential genes those that are activated in cancers. In the TCGA_category column, genes are tagged as “activated” when they are highly expressed in at least one cancer sample (TPM >= 10). But genes that were found to be expressed in all -or almost all- cancer cell lines were removed, as this probably reflects a constitutive expression rather than a true activation. We filtered out genes that were not completely repressed in at least 20 % of cancer samples (TPM <= 0.1).

IGV visualisation

All selected CT genes were visualised on IGV using a RNA-seq alignment from testis, to ensure that expression in testis really corrresponded to the canonical transcript. For some genes for which the canonical transcript did not correspond to the transcript that we could see in the testis sample, we modified manually the external_transcript_name accordingly, to ensure that the TSS and the promoter region are correctly defined. This is particularly important for methylation analyses that must be focused on true promoter regions.

Example of gene were the canonical transcript was changed manually, according to transcript structure observed in testis:

Few genes were also removed at that step, as reads were not aligned properly to any referenced transcript (example below).

CT genes controled by methylation

Genes flagged as TRUE in regulated_by_methylation column correspond to

  • Genes that are significantly induced by a demethylating agent (RNAseq analysis of cell lines reated with DAC (5-Aza-2′-Deoxycytidine)).

  • Genes that have a highly methylated promoter in normal somatic tissues but less methylated in germ cells (WGBS analysis of a set of normal tissues).

For some genes showing a strong activation in cells treated with 5-Aza-2′-Deoxycytidine, methylation analysis was not possible due to multimapping issues. In this case, genes were still considered as regulated by methylation unless their promoter appeared unmethylated in somatic tissues or methylated in germ cells.

controled_by_methylation <- CT_genes %>% 
  filter(regulated_by_methylation == TRUE)

not_controled_by_methylation <- CT_genes %>% 
  filter(regulated_by_methylation == FALSE)
  • 160 genes appear to be controled by methylation

  • 147 genes appear to be controled by methylation

CTexploreR functions

Expression in normal tissues

GTEX_expression()

Allows to visualise gene expression in GTEx tissues.

  • Applied to testis-specific genes
testis_specific <- CT_genes %>% 
  filter(testis_specificity == "testis_specific")
GTEX_expression(testis_specific$external_gene_name, units = "log_TPM")

  • Applied to testis-preferential genes
testis_preferential <- CT_genes %>% 
  filter(testis_specificity == "testis_preferential")
GTEX_expression(testis_preferential$external_gene_name, units = "log_TPM")

normal_tissue_multimapping_heatmap()

Allows to visualise the expression values obtained by counting or not multi-mapped reads in normal tissues.

  • Applied to CT_genes not expressed in GTEx database
testis_specific_in_multimapping_analysis <- CT_genes %>% 
  filter(lowly_expressed_in_GTEX)

multimapping option set to FALSE

normal_tissue_expression_multimapping(testis_specific_in_multimapping_analysis$external_gene_name, 
                                      multimapping = FALSE, units = "log_TPM")

multimapping option set to TRUE

normal_tissue_expression_multimapping(testis_specific_in_multimapping_analysis$external_gene_name, 
                                      multimapping = TRUE, units = "log_TPM")

Expression in cancer cells

CCLE_expression()

Allows to visualise gene expression in different histological types of CCLE cancer cell lines.

  • Applied to all CT_genes
CCLE_expression(genes = CT_genes$external_gene_name, 
                type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"), 
                units = "log_TPM")

  • Applied to testis preferential genes
CCLE_expression(genes = testis_preferential$external_gene_name, 
                type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"), 
                units = "log_TPM")

  • Applied to CT genes that are frequenty activated in CCLE cell lines

One can select genes base on activation frequencies in CCLE cell lines using the percent_of_positive_CCLE_cell_lines column of CT_genes.

frequently_activated <- CT_genes %>% 
  filter(percent_of_positive_CCLE_cell_lines >= 5)
CCLE_expression(genes = frequently_activated$external_gene_name, 
                type = c("Skin", "Lung", "Esophageal", "Head_and_Neck", "Leukemia"), 
                units = "log_TPM")

correlated_genes()

A function that use expression data from all CCLE cell lines and returns genes correlated (or anti-correlated) with specified CT gene.

correlated_genes("MAGEA3", 0.3)

correlated_genes("CT83", 0.3)

TCGA_expression()

Allows to visualise gene expression in cancer samples from TCGA (in ‘SKCM’, ‘LUAD’, ‘LUSC’, ‘COAD’, ‘ESCA’, ‘BRCA’, or ‘HNSC’ tumor types).

  • Applied to CT genes that are frequently activated in CCLE cell lines
TCGA_expression(genes = frequently_activated$external_gene_name, 
                tumor = "all", 
                units = "log_TPM")

TCGA_expression(genes = frequently_activated$external_gene_name, 
                tumor = "BRCA", 
                units = "log_TPM")

TCGA_expression(genes = frequently_activated$external_gene_name, 
                tumor = "LUAD", 
                units = "log_TPM")

Methylation analysis

DAC_induction()

Allows to visualise gene induction upon DAC treatment in a series of cell lines.

  • Applied to genes controled by methylation
DAC_induction(genes = controled_by_methylation$external_gene_name)

  • Applied to genes not controled by methylation
DAC_induction(genes = not_controled_by_methylation$external_gene_name)

normal_tissues_mean_methylation()

Gives the mean methylation level of CpGs located in each promoter region (defined as 1000 nt upstream TSS and 200 nt downstream TSS) in a set of normal tissues.

  • Applied to genes controled by methylation
normal_tissues_mean_methylation(genes = controled_by_methylation$external_gene_name)

  • Applied to genes not controled by methylation
normal_tissues_mean_methylation(genes = not_controled_by_methylation$external_gene_name)

Note that some genes are demethylated in testis but not (or extremely weakly!) induced by 5-aza.

=> Are considered as not regulated by methylation.

genes <- not_controled_by_methylation %>% 
  filter(somatic_methylation == TRUE, germline_methylation == FALSE) %>% 
  pull(external_gene_name)
normal_tissues_mean_methylation(genes = genes)

DAC_induction(gene = genes)

! Some genes are already expressed in all control cell lines!

=> not a good model to test DAC induction for these genes…

NB: Tested their expression in HCT166 DKO, none are highly induced. (NB: few MAGE genes were added as comparison points)

OPEN QUESTION: should we include HCT116 DKO data?

load("/data/cbio/CBIOProjects/202210_CBIO_CTexploreR/datasets/HCT116DKO/RNAseq/processed_data/count_matrix/dds_MP.rda")
ensembl_ids <- CT_genes %>% 
  filter(external_gene_name %in% genes) %>%
  pull(ensembl_gene_id)
MAGE_ids <- CT_genes %>%
  filter(external_gene_name %in% c("MAGEA6", "MAGEA1", "MAGEA4", "MAGEC1")) %>%
  pull(ensembl_gene_id)
as_tibble(round(counts(dds[c(ensembl_ids, MAGE_ids),], normalize = TRUE), 2), rownames = "ensembl_gene_id") %>% 
  left_join(CT_genes %>%
              select(external_gene_name, ensembl_gene_id)) %>% 
  select(external_gene_name, everything()) %>% 
  select(-ensembl_gene_id) %>% 
  arrange(desc(SRR12849891)) %>% 
  DT::datatable()
## Joining, by = "ensembl_gene_id"
# ensembl_ids <- CT_genes %>% filter(external_gene_name %in% c("LIN28A", "LIN28B", "CT83")) %>% pull(ensembl_gene_id)
# MAGE_ids <- CT_genes %>% filter(external_gene_name %in% c("MAGEA6", "MAGEA1", "MAGEA4", "MAGEC1")) %>% pull(ensembl_gene_id)
# as_tibble(round(counts(dds[c(ensembl_ids, MAGE_ids),], normalize = TRUE), 2), rownames = "ensembl_gene_id") %>% 
#   left_join(CT_genes %>% select(external_gene_name, ensembl_gene_id)) %>% 
#   select(external_gene_name, everything()) %>% 
#   select(-ensembl_gene_id) %>% 
#   arrange(desc(SRR12849891)) %>% 
#   DT::datatable()

normal_tissues_methylation()

Can be used to analyse more precisely the methylation of a particular promoter in normal tissues as methylation values are given CpG by CpG.

  • Applied to a gene controled by methylation
normal_tissues_methylation("MAGEB2")

normal_tissues_methylation("LIN28B")

  • Applied to a gene not controled by methylation
normal_tissues_methylation("LIN28A")

TCGA_methylation_expression_correlation()

Shows the correlation between gene expression and promoter methylation in TCGA samples.

  • Applied to a gene controled by methylation
TCGA_methylation_expression_correlation(tumor = "LUAD", 
                                        gene = "TDRD1")

TCGA_methylation_expression_correlation(tumor = "all", 
                                        gene = "TDRD1")

  • Applied to a gene not controled by methylation
TCGA_methylation_expression_correlation(tumor = "all", 
                                        gene = "LIN28A")

Session info

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggrepel_0.9.2               SummarizedExperiment_1.26.1
##  [3] Biobase_2.56.0              GenomicRanges_1.48.0       
##  [5] GenomeInfoDb_1.32.4         IRanges_2.30.1             
##  [7] S4Vectors_0.34.0            BiocGenerics_0.42.0        
##  [9] MatrixGenerics_1.8.1        matrixStats_0.62.0         
## [11] forcats_0.5.2               stringr_1.4.1              
## [13] dplyr_1.0.10                purrr_1.0.1                
## [15] readr_2.1.3                 tidyr_1.2.1                
## [17] tibble_3.1.8                ggplot2_3.4.0              
## [19] tidyverse_1.3.2             CTexploreR_0.1.0           
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0      colorspace_2.0-3       rjson_0.2.21          
##   [4] ellipsis_0.3.2         circlize_0.4.15        XVector_0.36.0        
##   [7] GlobalOptions_0.1.2    fs_1.5.2               clue_0.3-62           
##  [10] rstudioapi_0.14        farver_2.1.1           bit64_4.0.5           
##  [13] DT_0.26                AnnotationDbi_1.58.0   fansi_1.0.3           
##  [16] lubridate_1.8.0        xml2_1.3.3             splines_4.2.0         
##  [19] codetools_0.2-18       doParallel_1.0.17      cachem_1.0.6          
##  [22] geneplotter_1.74.0     knitr_1.40             jsonlite_1.8.3        
##  [25] broom_1.0.1            annotate_1.74.0        cluster_2.1.4         
##  [28] dbplyr_2.2.1           png_0.1-7              compiler_4.2.0        
##  [31] httr_1.4.4             backports_1.4.1        assertthat_0.2.1      
##  [34] Matrix_1.5-1           fastmap_1.1.0          gargle_1.2.1          
##  [37] cli_3.6.0              htmltools_0.5.3        tools_4.2.0           
##  [40] gtable_0.3.1           glue_1.6.2             GenomeInfoDbData_1.2.8
##  [43] Rcpp_1.0.9             cellranger_1.1.0       jquerylib_0.1.4       
##  [46] Biostrings_2.64.1      vctrs_0.5.1            iterators_1.0.14      
##  [49] crosstalk_1.2.0        xfun_0.34              rvest_1.0.3           
##  [52] lifecycle_1.0.3        XML_3.99-0.12          googlesheets4_1.0.1   
##  [55] zlibbioc_1.42.0        scales_1.2.1           hms_1.1.2             
##  [58] parallel_4.2.0         RColorBrewer_1.1-3     ComplexHeatmap_2.12.1 
##  [61] yaml_2.3.6             memoise_2.0.1          sass_0.4.2            
##  [64] RSQLite_2.2.18         stringi_1.7.8          highr_0.9             
##  [67] genefilter_1.78.0      foreach_1.5.2          BiocParallel_1.30.4   
##  [70] shape_1.4.6            rlang_1.0.6            pkgconfig_2.0.3       
##  [73] bitops_1.0-7           evaluate_0.17          lattice_0.20-45       
##  [76] htmlwidgets_1.5.4      labeling_0.4.2         bit_4.0.4             
##  [79] tidyselect_1.2.0       magrittr_2.0.3         DESeq2_1.36.0         
##  [82] R6_2.5.1               generics_0.1.3         DelayedArray_0.22.0   
##  [85] DBI_1.1.3              pillar_1.8.1           haven_2.5.1           
##  [88] withr_2.5.0            survival_3.4-0         KEGGREST_1.36.3       
##  [91] RCurl_1.98-1.9         modelr_0.1.9           crayon_1.5.2          
##  [94] utf8_1.2.2             tzdb_0.3.0             rmarkdown_2.17        
##  [97] GetoptLong_1.0.5       locfit_1.5-9.6         grid_4.2.0            
## [100] readxl_1.4.1           blob_1.2.3             reprex_2.0.2          
## [103] digest_0.6.30          xtable_1.8-4           munsell_0.5.0         
## [106] bslib_0.4.0